2 results
7 - Binary Classification with Support Vector Machines
-
- By Patrick Nichols, Pacific Northwest National Laboratory, Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory, Christopher Oehmen, Pacific Northwest National Laboratory
- Edited by Ian Gorton, Deborah K. Gracio
-
- Book:
- Data-Intensive Computing
- Published online:
- 05 December 2012
- Print publication:
- 29 October 2012, pp 157-179
-
- Chapter
- Export citation
-
Summary
Introduction
Support vector machines (SVM) are currently one of the most popular and accurate methods for binary data classification and prediction. They have been applied to a variety of data and situations such as cyber-security, bioinformatics, web searches, medical risk assessment, financial analysis, and other areas [1]. This type of machine learning is shown to be accurate and is able to generalize predictions based upon previously learned patterns. However, current implementations are limited in that they can only be trained accurately on examples numbering to the tens of thousands and usually run only on serial computers. There are exceptions. A prime example is the annual machine learning and classification competitions such as the International Conference on Artificial Neural Networks (ICANN), which present problems with more than 100,000 elements to be classified. However, in order to treat such large test cases the formalism of the support vector machines must be modified.
SVMs were first developed by Vapnik and collaborators [2] as an extension to neural networks. Assume that we can convert the data values associated with an entity into numerical values that form a vector in the mathematical sense. These vectors form a space. Also, assume that this space of vectors can be separated by a hyperplane into the vectors that belong to one class and those that form the opposing class.
9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time
-
- By Christopher Oehmen, Pacific Northwest National Laboratory, Scott Dowson, Pacific Northwest National Laboratory, Wes Hatley, Future Point Systems, Justin Almquist, Pacific Northwest National Laboratory, Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory, Jason McDermott, Pacific Northwest National Laboratory, Ian Gorton, Pacific Northwest National Laboratory, Lee Ann McCue, Pacific Northwest National Laboratory
- Edited by Ian Gorton, Deborah K. Gracio
-
- Book:
- Data-Intensive Computing
- Published online:
- 05 December 2012
- Print publication:
- 29 October 2012, pp 235-257
-
- Chapter
- Export citation
-
Summary
Discovering Biological Mechanisms through Exploration
The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.